Mix-up training approaches have proven to be effective in improving the generalization ability of Deep Neural Networks. Over the years, the research community expands mix-up methods into two directions, with extensive efforts to improve saliency-guided procedures but minimal focus on the arbitrary path, leaving the randomization domain unexplored. In this paper, inspired by the superior qualities of each direction over one another, we introduce a novel method that lies at the junction of the two routes. By combining the best elements of randomness and saliency utilization, our method balances speed, simplicity, and accuracy. We name our method R-Mix following the concept of "Random Mix-up". We demonstrate its effectiveness in generalization, weakly supervised object localization, calibration, and robustness to adversarial attacks. Finally, in order to address the question of whether there exists a better decision protocol, we train a Reinforcement Learning agent that decides the mix-up policies based on the classifier's performance, reducing dependency on human-designed objectives and hyperparameter tuning. Extensive experiments further show that the agent is capable of performing at the cutting-edge level, laying the foundation for a fully automatic mix-up. Our code is released at [https://github.com/minhlong94/Random-Mixup].
translated by 谷歌翻译
The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis, we designed an international survey that was issued to all participants of challenges conducted in conjunction with the IEEE ISBI 2021 and MICCAI 2021 conferences (80 competitions in total). The survey covered participants' expertise and working environments, their chosen strategies, as well as algorithm characteristics. A median of 72% challenge participants took part in the survey. According to our results, knowledge exchange was the primary incentive (70%) for participation, while the reception of prize money played only a minor role (16%). While a median of 80 working hours was spent on method development, a large portion of participants stated that they did not have enough time for method development (32%). 25% perceived the infrastructure to be a bottleneck. Overall, 94% of all solutions were deep learning-based. Of these, 84% were based on standard architectures. 43% of the respondents reported that the data samples (e.g., images) were too large to be processed at once. This was most commonly addressed by patch-based training (69%), downsampling (37%), and solving 3D analysis tasks as a series of 2D tasks. K-fold cross-validation on the training set was performed by only 37% of the participants and only 50% of the participants performed ensembling based on multiple identical models (61%) or heterogeneous models (39%). 48% of the respondents applied postprocessing steps.
translated by 谷歌翻译
Video understanding is a growing field and a subject of intense research, which includes many interesting tasks to understanding both spatial and temporal information, e.g., action detection, action recognition, video captioning, video retrieval. One of the most challenging problems in video understanding is dealing with feature extraction, i.e. extract contextual visual representation from given untrimmed video due to the long and complicated temporal structure of unconstrained videos. Different from existing approaches, which apply a pre-trained backbone network as a black-box to extract visual representation, our approach aims to extract the most contextual information with an explainable mechanism. As we observed, humans typically perceive a video through the interactions between three main factors, i.e., the actors, the relevant objects, and the surrounding environment. Therefore, it is very crucial to design a contextual explainable video representation extraction that can capture each of such factors and model the relationships between them. In this paper, we discuss approaches, that incorporate the human perception process into modeling actors, objects, and the environment. We choose video paragraph captioning and temporal action detection to illustrate the effectiveness of human perception based-contextual representation in video understanding. Source code is publicly available at https://github.com/UARK-AICV/Video_Representation.
translated by 谷歌翻译
The biomedical imaging world is notorious for working with small amounts of data, frustrating state-of-the-art efforts in the computer vision and deep learning worlds. With large datasets, it is easier to make progress we have seen from the natural image distribution. It is the same with microscopy videos of neuron cells moving in a culture. This problem presents several challenges as it can be difficult to grow and maintain the culture for days, and it is expensive to acquire the materials and equipment. In this work, we explore how to alleviate this data scarcity problem by synthesizing the videos. We, therefore, take the recent work of the video diffusion model to synthesize videos of cells from our training dataset. We then analyze the model's strengths and consistent shortcomings to guide us on improving video generation to be as high-quality as possible. To improve on such a task, we propose modifying the denoising function and adding motion information (dense optical flow) so that the model has more context regarding how video frames transition over time and how each pixel changes over time.
translated by 谷歌翻译
Although unsupervised domain adaptation methods have achieved remarkable performance in semantic scene segmentation in visual perception for self-driving cars, these approaches remain impractical in real-world use cases. In practice, the segmentation models may encounter new data that have not been seen yet. Also, the previous data training of segmentation models may be inaccessible due to privacy problems. Therefore, to address these problems, in this work, we propose a Continual Unsupervised Domain Adaptation (CONDA) approach that allows the model to continuously learn and adapt with respect to the presence of the new data. Moreover, our proposed approach is designed without the requirement of accessing previous training data. To avoid the catastrophic forgetting problem and maintain the performance of the segmentation models, we present a novel Bijective Maximum Likelihood loss to impose the constraint of predicted segmentation distribution shifts. The experimental results on the benchmark of continual unsupervised domain adaptation have shown the advanced performance of the proposed CONDA method.
translated by 谷歌翻译
Modern Review Helpfulness Prediction systems are dependent upon multiple modalities, typically texts and images. Unfortunately, those contemporary approaches pay scarce attention to polish representations of cross-modal relations and tend to suffer from inferior optimization. This might cause harm to model's predictions in numerous cases. To overcome the aforementioned issues, we propose Multimodal Contrastive Learning for Multimodal Review Helpfulness Prediction (MRHP) problem, concentrating on mutual information between input modalities to explicitly elaborate cross-modal relations. In addition, we introduce Adaptive Weighting scheme for our contrastive learning approach in order to increase flexibility in optimization. Lastly, we propose Multimodal Interaction module to address the unalignment nature of multimodal data, thereby assisting the model in producing more reasonable multimodal representations. Experimental results show that our method outperforms prior baselines and achieves state-of-the-art results on two publicly available benchmark datasets for MRHP problem.
translated by 谷歌翻译
Robots have been brought to work close to humans in many scenarios. For coexistence and collaboration, robots should be safe and pleasant for humans to interact with. To this end, the robots could be both physically soft with multimodal sensing/perception, so that the robots could have better awareness of the surrounding environment, as well as to respond properly to humans' action/intention. This paper introduces a novel soft robotic link, named ProTac, that possesses multiple sensing modes: tactile and proximity sensing, based on computer vision and a functional material. These modalities come from a layered structure of a soft transparent silicon skin, a polymer dispersed liquid crystal (PDLC) film, and reflective markers. Here, the PDLC film can switch actively between the opaque and the transparent state, from which the tactile sensing and proximity sensing can be obtained by using cameras solely built inside the ProTac link. In this paper, inference algorithms for tactile proximity perception are introduced. Evaluation results of two sensing modalities demonstrated that, with a simple activation strategy, ProTac link could effectively perceive useful information from both approaching and in-contact obstacles. The proposed sensing device is expected to bring in ultimate solutions for design of robots with softness, whole-body and multimodal sensing, and safety control strategies.
translated by 谷歌翻译
Recent studies on adversarial images have shown that they tend to leave the underlying low-dimensional data manifold, making them significantly more challenging for current models to make correct predictions. This so-called off-manifold conjecture has inspired a novel line of defenses against adversarial attacks on images. In this study, we find a similar phenomenon occurs in the contextualized embedding space induced by pretrained language models, in which adversarial texts tend to have their embeddings diverge from the manifold of natural ones. Based on this finding, we propose Textual Manifold-based Defense (TMD), a defense mechanism that projects text embeddings onto an approximated embedding manifold before classification. It reduces the complexity of potential adversarial examples, which ultimately enhances the robustness of the protected model. Through extensive experiments, our method consistently and significantly outperforms previous defenses under various attack settings without trading off clean accuracy. To the best of our knowledge, this is the first NLP defense that leverages the manifold structure against adversarial attacks. Our code is available at \url{https://github.com/dangne/tmd}.
translated by 谷歌翻译
The sentiment analysis task has various applications in practice. In the sentiment analysis task, words and phrases that represent positive and negative emotions are important. Finding out the words that represent the emotion from the text can improve the performance of the classification models for the sentiment analysis task. In this paper, we propose a methodology that combines the emotion lexicon with the classification model to enhance the accuracy of the models. Our experimental results show that the emotion lexicon combined with the classification model improves the performance of models.
translated by 谷歌翻译
现有的最新3D点云实例分割方法依赖于基于分组的方法,该方法指向获得对象实例。尽管产生准确的分割结果方面有所改善,但这些方法缺乏可扩展性,通常需要将大量输入分为多个部分。为了处理数百万点的场景,现有的最快方法软组\ cite {vu2022222222222222222222222222222222222222ggroup}需要数十秒钟,这是满意的。我们的发现是,$ k $ neart的邻居($ k $ -nn)是分组的先决条件,是计算瓶颈。这种瓶颈严重使现场的推理时间恶化了很多。本文提出了软组++来解决此计算瓶颈,并进一步优化了整个网络的推理速度。 SoftGroup ++建立在软组上,这在三个重要方面有所不同:(1)执行OCTREE $ K $ -NN而不是Vanilla $ k $ -nn,以将时间复杂性从$ \ Mathcal {o}(n^2)缩短到$ \ Mathcal {o}(n \ log n)$,(2)执行金字塔缩放,适应性下降样本骨干输出以减少$ k $ -nn和分组的搜索空间,并且(3)执行后期的Devoxelization,延迟了Voxels的转换指向模型的结束,以使中间组件以低计算成本运行。在各种室内和室外数据集上进行了广泛的实验,证明了拟议的软组++的功效。值得注意的是,SoftGroup ++在一个前方的情况下通过单个前方进行了大量的场景,而无需将输入分为多个部分,从而丰富了上下文信息。特别是,SoftGroup ++达到2.4点AP $ _ {50} $改进,而$ 6 \ $ 6 \ times $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $。代码和训练有素的模型将公开可用。
translated by 谷歌翻译